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ABSTRACT 

As data are increasingly modeled as graphs for expressing com- 
plex relationships, the tree pattern query on graph-structured data 
becomes an important type of queries in real-world applications. 
Most practical query languages, such as XQuery and SPARQL, 
support logical expressions using logical- AND/OR/NOT operators 
to define structural constraints of tree patterns. In this paper, (1) 
we propose generalized tree pattern queries (GTPQs) over graph- 
structured data, which fully support propositional logic of struc- 
tural constraints. (2) We make a thorough study of fundamental 
problems including satisfiability, containment and minimization, 
and analyze the computational complexity and the decision pro- 
cedures of these problems. (3) We propose a compact graph repre- 
sentation of intermediate results and a pruning approach to reduce 
the size of intermediate results and the number of join operations - 
two factors that often impair the efficiency of traditional algorithms 
for evaluating tree pattern queries. (4) We present an efficient algo- 
rithm for evaluating GTPQs using 3-hop as the underlying reach- 
ability index. (5) Experiments on both real-life and synthetic data 
sets demonstrate the effectiveness and efficiency of our algorithm, 
from several times to orders of magnitude faster than state-of-the- 
art algorithms in terms of evaluation time, even for traditional tree 
pattern queries with only conjunctive operations. 

1. INTRODUCTION 

Graphs are among the most ubiquitous data models for many 
areas, such as social networks, semantic web and biological net- 
works. As the most common tool for data transmissions, XML 
documents are desirably modeled as graphs rather than trees to 
represent flexible data structures by incorporating the concept of 
ID/IDREFs. Semantic Web data are also modeled as graphs, e.g. 
in RDF/RDFS. On graph data, tree pattern queries (TPQs) are one 
of important queries of practical interest. In query languages such 
as XQuery and SPARQL, many queries can be regarded as TPQs 
over graphs. As most of them support logical operations includ- 
ing conjunction (A), disjunction (V) and negation (-i) in the query 
conditions, it is necessary to study TPQs over graphs with multiple 
logical predicates, as illustrated in the following example. 

Example 1. A DBLP XML document separately stores inproceed- 
ing records for papers and proceeding records for volumes, linked 
by crossref elements indicating where a paper is published |24|. 
The underlying data structure is clearly a graph. Consider the fol- 
lowing three queries which ask for information of publications for 
which a certain tree pattern of data holds. 

Qi : Retrieve the information about Alice's conference papers that are pub- 
lished from 2000 to 2010 and co-authored with Bob. 

Q2 '- Retrieve the information about the conference papers of either Alice 
or Bob pubUshed from 2000 to 2010. 




/author="Alice7author='"Bob" /year /title 



^— ^/ /inproceedings 

^^//proceedings 



/title /yeare|2000, 2010] 

Figure 1: The tree representation of Qi, Q2, and in Exam- 
ple 1. Document elements matching the starred query nodes 
are required to be returned and the single-/double-lined edges 
denote the parent-child/ancestor-descendant relationships be- 
tween elements. 



Q3: Retrieve the information about Alice's conference papers that are not 
co-authored with Bob and pubhshed from 2000 to 2010. 

They can be expressed in XQuery and are essentially TPQs on 
graph- structured data (see the Appendix), but Q2 and Qs cannot 
be expressed in traditional TPQs, which only contain conjunctive 
predicates. Indeed, they share the same tree representation as de- 
picted in Fig. [T] but different structural predicates should be im- 
posed on the inproceedings element ui. For example, in Qi, 
each embedding of the pattern should satisfy all paths specified 
in the query; but for Q2, the two path conditions "111-112" and 
"U1-U3" are not required to be satisfied simultaneously. A pred- 
icate that specifies those edge constraints and incorporates disjunc- 
tion and negation needs to be attached to each query node in order 
to express Q2 and Q3. In general, (1) it is common in practice 
that logical expressions on query nodes needs to be imposed to 
specify complex relationships for not only attribute predicates (e.g. 
2000 < year < 2010) but also structural constraints (e.g. (ui-U2 
or iti-ita) in Q2 and not{ui-U3) in Q3); (2) some of the nodes 
(e.g. Ui{i G {1, 2, 3, 6, 8})) in the query pattern only serve as fil- 
ters for pruning unexpected results, which means that the results of 
a TPQ should consist of matches for a portion of the query nodes 
only. □ 

Although TPQs have been widely studied for many years, few 
of the proposed processing algorithms can be used to efficiently 
evaluate such queries over general graphs. They can neither support 
disjunction and negation on structural constraints nor be optimized 
for the situation where output nodes take only a portion of query 
nodes (see Related work for details). 

Contributions & Roadmap. This work makes the first effort to 
deal with TPQ over general graph-structured data with Boolean 
logic support. The contributions are summarized as follows. 
(1) We introduce a new class of tree pattern queries over graph- 
structured data, called generalized tree pattern queries (GTPQs) 
(Section |2j. In a GTPQ, a node is not only associated with an at- 



tribute predicate, which specifies the property conditions, but also a 
structural predicate in terms of prepositional logic with logic con- 
nectives including conjunction, negation and disjunction to specify 
structural conditions with respect to its descendants. The query al- 
lows a portion of the query nodes to be output nodes. We also show 
that our formalization of query is advantageous over those in the 
literature on queries against tree- structured data. 

(2) We investigate fundamental problems for GTPQs, including sat- 
isfiability, containment, equivalence and minimization (Section[3}. 
We show that the satisfiability of a special GTPQ with only con- 
junction and disjunction is solvable in linear time, but the satisfi- 
ability and the other three problems become computationally in- 
tractable when disjunction is incorporated. We propose an exact 
algorithm to minimize GTPQs, which is supposed to be sufficiently 
efficient, since the query sizes are typically small in practice. 

(3) We propose a graph representation of intermediate results and 
a pruning approach to address notable problems in evaluating query 
patterns over graphs, develop an algorithm for GTPQs with ancestor- 
descendant edges and its extension to deal with parent-child edges 
(Section|4j. The algorithm can largely filter nodes that cannot con- 
tribute to the final results, wisely avoid generating redundant inter- 
mediate results, and compactly represent the matches. 

(4) We implement our algorithm and conduct an experimental study 
using synthetic and real-life data (Section[5j. We find that our eval- 
uation algorithm performs significantly better than state-of-art al- 
gorithms even for conjunctive TPQs. It also has better scalability 
and is robust for different queries on different graphs. The exper- 
iments also demonstrate the effectiveness of the graph representa- 
tion of results and the efficiency of the pruning method. 

Related work. There is a large body of research work on TPQs 
over tree-structured data (see f 141 for a survey). However, all stud- 
ies heavily relied on the relatively simple structure of trees and 
employed the node encoding schemes (including the interval |3|, 
Dewey |21 1 and sequence | 28| encodings) that are not applicable to 
graphs for determining structural relationships. Techniques critical 
for their efficiency, such as stack encoding and nodes skipping, can 
be only applied to tree-structured data. For some sparse graph data 
whose structures can be modeled by disjoint trees connected by 
edges, such as many XML documents with ID/IDREFs, although 
one can apply those existing algorithms for tree-structured data to 
evaluate a query over such graphs by first decomposing it to several 
TPQs over different trees and then merging the results of distinct 
queries to form the final results, it is inefficient due to large redun- 
dant intermediate results and costly merging processes. 

Some studies extended the traditional TPQs by incorporating ad- 
ditional functions and restrictions. Chen et al. 1 9 1 included optional 
nodes to patterns and investigated efficient evaluation plans upon 
native XML database systems. The generalized tree pattern is still 
against tree-structured data, which differs from this work that stud- 
ies TPQs over graph-structured data with logical predicates. Jiang 
et al. [16'| proposed new holistic algorithms based on a concept of 
OR-blocks to process AND/OR-twigs, TPQs with OR-predicates. 
In the end of Section 2, we shall show that (1) our query size can 
be always no larger than the size of element nodes of AND/OR- 
twig for expressing a semantically identical query; (2) constructing 
OR-blocks involves converting a propositional formula to conjunc- 
tive normal form, thus taking exponential time in the worst case; 
(3) the proposed algorithms only support tree-struct-ured data as 
input. 1 17] studied path queries with negation, while |29| and | 20| 
added negation to TPQs. They cannot be applied to GTPQs either, 
since they are based on the classical holistic twig join algorithm 1 3 1 
that only works on tree- structured data. 

There has been work on pattern queries for graph- structured data. 



TwigStackD f6l generalized the holistic algorithms, but it takes 
considerable time and space without a pre-filtering process (30J. 
HGJoin [27] can evaluate general graph pattern queries using OPT- 
tree-cover 1 1 1 as the underlying reachability indexing approach. It 
decomposes a pattern into a set of complete bipartite graphs and 
generates matches for them in order according to a plan. The time 
cost of plan generation is always exponential since it has to pro- 
duce a state graph with exponential nodes no matter for obtain- 
ing an optimal or suboptimal plan. Cheng et al. fill proposed ^- 
join/i?-semijoin processing for the graph pattern matching problem. 
It relies on a cluster-based R-join index whose size is typically pro- 
hibitively large, as the index stores matches for every two labels 
derived from 2-hop indexing [12|. Unlike the plan generation of 
HGJoin, it adopts left-join to reduce the cost, but in the worst case 
the time complexity is still exponential. Since both HGJoin and 
/?-join/i?-semijoin use structural joins similar to the earlier work on 
tree-structured data, they typically have large intermediate results 
and need to perform large amounts of expensive join operations. 
All these three algorithms also do not directly support queries with 
negative/disjunctive predicates. A straightforward approach to ap- 
ply them to the GTPQ processing is to decompose the query into 
multiple conjunctive TPQs and perform the difference and merge 
operations on results of the decomposed queries. However, the 
number of the resultant conjunctive TPQs may be exponential and 
large intermediate results may need to be generated and merged. 

A number of studies investigated various graph pattern match- 
ing problems |13I|15||3T| . (TS I proposed a graph query language 
GraphQL and studied graph-specific optimization techniques for 
graph pattern matching that combines subgraph isomorphism and 
predicate evaluation. While the language is able to express queries 
with ancestor-descendant edges and disjunctive predicates, the work 
focused on processing ^non-recursive^s and conjunctive graph pat- 
tern queries, where all edges of a query pattern correspond to the 
parent-child edges of GTPQs, specifying the adjacent relationship 
between desired matching nodes. |13 | defined matching in terms 
of bounded simulation to reduce its computation complexity. |31| 
studied distance pattern matching, in which query edges are mapped 
to paths with a bounded length. Queries of [131 and [311 do not sup- 
port negative/disjunctive predicates on edges and have quite differ- 
ent semantics with ours. 

Most existing algorithms are to find all instances of patterns con- 
taining matches of all query nodes. In real-world applications, 
however, the answer to the query often only require matches of sev- 
eral but not all query nodes. Indeed, many query nodes only serve 
as filters for imposing structural constraints on output nodes. Our 
framework can avoid generating redundant matches at run time. 

Satisfiability, containment, equivalence and minimization are fun- 
damental problems for any query languages. The minimization of 
TPQs over tree-structured data has been investigated in several pa- 
pers. Amer-Yahia et al. [2 | proposed algorithms for the minimiza- 
tion with and without integrity constraints. Ramanan 1231 studied 
this problem for TPQs defined by graph simulation. Chen et al. (5) 
used a richer class of integrity constraints for query minimization 
of TPQs with an unique output node. However, we are not aware of 
previous work on minimization as well as the other three problems 
for TPQs with logical predicates either over tree-structured data or 
over graph-structured data. 

2. DATA MODEL AND GENERALIZED 
TREE PATTERN QUERIES 

Data graphs. A data graph is a directed graph G — {V,E,f), 
where (1) V is a finite set of nodes; (2) E (- V x V is finite set of 



edges, in which each pair (v, v') denotes an edge from vtov'\ (3) / 
is a function on V defining attribute values associated with nodes. 
For each node v £ V , f{v) is a tuple (Ai — ai, . . . , An — a„), 
where the expression Ai = ai{i £ [1, represents that v has a 
attribute denoted by Ai and its value is a constant a^. For example, 
in a data graph G — {V,E,f) of a DBLP document, the node 
properties in / may include tags, string values, typed values, and 
attributes specified in the elements. 

Abusing notions for trees and traditional tree pattern queries, we 
refer to a node 112 as a child of a node vi (or vi as a parent of U2) 
and say they have a parent-child (PC) relationship if there is an 
edge U2) in E, and refer to V2 as a descendant of vi (or vi as 
an ancestor of V2) and say they have an ancestor-descendant (AD) 
relationship if there is a nonempty path from «i to V2 in G. 

Generalized tree pattern queries. A generalized tree pattern query 

(GTPQ) Q = (H, Vp, Vo, E„ fa, fejs), where: 

(1) Vf, and Vp are both a finite set of nodes, called backbone nodes 
and predicate nodes, respectively. The complete set of query nodes 
is denoted as Vq, i.e., Vq = VbU Vp. 

(2) Vo CI Vb. The nodes in Vo are called output nodes. 

(3) Eq C {{ui,U2)\ui,U2 € Vt}U{{ui,U2)\ui G VbUVp,U2 e 
Vp}, is a finite set of edges. Here, {Vq,Eq) is restricted to a di- 
rected tree . 

(4) fa is a function defined on Vq such that for each node u £ Vq, 
fa{u) is an attribute predicate that is a conjunction of atomic for- 
mulas of the form of "A op a", in which A is an attribute name, a is 
a constant and op is a comparison operator in{<,<,=,=?^,>,>}. 

(5) /c is a function on Eq to specify the type of the edge. Each 
edge («i , U2 ) represents either PC relationship or AD relationship. 

(6) fs is a function defined on internal nodes. For each inter- 
nal node u £ Vq with k children being predicate nodes, fs{u), 
called a structural predicate, is a propositional formula in k vari- 
ables p^i^ , ■ • ■ , Pu' ! each corresponding to a tree edge directing to 
a predicate child of u. In particular, if u has no predicate children, 
fs{u) = 1. Each node u is associated with a distinct propositional 
variable denoted by pu ■ 

We call a GTPQ a union-conjunctive GTPQ if the structural pred- 
icates on all query nodes are negation-free, and call it a conjunctive 
GTPQ if the structural predicates on all the query nodes only have 
conjunction coimectives. 

Before giving the semantics of GTPQs, we add variables for non- 
root backbone nodes to extend the structural predicate. For an inter- 
nal node u with k' backbone children, denoted by ui , . . . , it^/ , the 
extended structural predicate fext{u) = p^i A . . . Apu^, A fs{u). 

Example 2. In Example[T] Qi = (14, Vp, Vo, Eq, fs,fe,fs) is a 
conjunctive GTPQ, in which (1) 14 — M4, 1*5, iie, 1*7}, Vp = 
{u2,us,us}, Vo ~ {u4,U5,U7}; (2) the attribute predicate fa for 
a query node is a conjunction of comparisons among tags and typed 
values (e.g. fa{u2) = (tag — "author" A value — "Bob")); (3) 
fsiui) = Pu2 A p„3, and fsiue) = Pug- The only difference 
between Q2 and Qi is that in Q2, fs{ui) — Pu2 V pu,. In Qg, 
fs{ui) = Pu2 A ~'Ptt3. As an example of extended structural pred- 
icates, for (32, /ea;t(ui) = {Pu2 y Pu^) f\Pui Ap„5 Ap„g. □ 

Semantics. Consider a data graph G = (V, E, f) and a GTPQ 

Q = {Vb, Vp, Vo, Eq, fa, fe, fs). We Say that a data node n in G 
downwardly matches a query node u in Q, denoted by v \= u, if 
the following conditions are satisfied: 

(I) V satisfies the attribute predicate of u, denoted by u ~ m. That 
is, for each formula "A op a" in fa{u), there is an element (A — a') 
in f{v) such that a op a. v is called a candidate matching node of 
It. mat{u) denotes the set of candidate matching nodes of u, i.e., 
mat{u) — {v\v £V,v ^ u}. 




(b) GTPQ Q on G 

Figure 2: Example of a data graph and a GTPQ. We use a rect- 
angle to represent a predicate node and a circle to represent a 
backbone node. 




(a) B-twig query (b) GTPQ 



Figure 3: Comparison between a B-twig query and a GTPQ 

(2) If u is an internal node, the data node v determines a truth as- 
signment to the variables of fextiu) such that fextiu) = 1, where 
fext (f*) denotes the truth-value of f^xt under the assignment. For 
each variable p„/, the truth-value p]^, is assigned as follows: for 
each PC (resp. AD) child it' of u, p'^, = 1 if there exists a child 
(resp. descendant) v' of v such that v' \= u'; otherwise, p^, — 0. 

Let Vb — {ui , . . . , Um}. A m-ary tuple (t^i , . . . , Vm) of nodes 
in G is said to be a match of Q on G, if the following conditions 
hold: (1) for each Vi{i £ [l,m]), Vi \= m; (2) for each edge 
(ui, Uj) £ Eq(i,j £ [1, m]), if Uj is a PC child of Ui, Vj is a child 
of Vi ; otherwise, Vj is a descendant of Vi . 

The answer QiG) to Q is a set of results in the form of tuples, 
where each tuple consists of the images of output nodes 14 in a 
match of Q. For each match, there is at least an assignment for all 
variables that makes the extended structural predicates of all inter- 
nal backbone nodes and some of internal predicate nodes evaluate 
to true, which we call a certificate of the match. For a match and an 
assignment as a certificate of the match, an instance of Q on G is a 
tuple consisting of such nodes that each of them matches a distinct 
query node whose corresponding propositional variable is true un- 
der the assignment. In particular, an instance of conjunctive GTPQ 
is exactly a match of the query. 

Example 3. For simplicity of presentation, a lower-case letter Xi 
in all figures throughout this paper denotes f{y) for a data node v 
and a capital letter Yj denotes fa (u) for a query node it such that 
« ~ It if J < i and X = Y. 

Consider the data graph and the query shown in Fig. (2] «i3 ~ 
Us, Vis 7^ Us. Accordingly, mat{us,) — {vis} , mat{uio) = 
{v-3,vio,vi3,vis}. The answer (5(G) = {(i;3,«ii), {v3,vi2), (us, 
W14), (w8,t'i2), (w8,i'i4)}. One of the query matches leading to 



{v3,vii) is («i, U3, U3, uii), where elements are sorted in the as- 
cending order of the subscripts of corresponding query nodes. An 
instance of this match is {ui : Vi,U2 : vg,U3 : vg,U4 : Vii,u-7 : 
V(i,us : vii,ug : uis}, where 'u : v' means u is a match of u. 
Indeed, «3 |= 113, because (1) 113 ~ ug, and (2) /^^(('"s) ~ ^ since 
«6 1= ur and vn \= ug. Also, |= M3, because us cannot reach a 
node matching ua and hence = 0, thereby f^^tiu^) — 1. □ 

For simplicity of semantics, we require a query to explicitly spec- 
ify backbone nodes and predicate nodes and restrict output nodes to 
backbone ones. The distinction between the two types of nodes is 
that prepositional variables associated with backbone nodes are dis- 
allowed to be operands of negation and disjunction as those asso- 
ciated with predicate nodes, which guarantees that each backbone 
node has an image in a match of the query. Permitting negation and 
disjunction on any query nodes leads to issues that are not compu- 
tationally desirable. If each query result is still required to have an 
image for each output node, the expressive power does not change; 
but to determine whether a query is valid is effectively to check 
whether the variables associated with output nodes are always true 
for all certificates of matches, which is a co-NP-complete problem. 
Otherwise, the output structures become not fixed. They can ei- 
ther be specifically defined in the query, or consist of exponential 
combinations of output nodes by default. Our algorithm described 
in Section |4] can be straightforwardly extended to process queries 
with multiple output structures (see the Appendix). 

We now compare GTPQ with the works in 1291 and f4^. f291 
deals with AND/OR-twig against tree-structured data. |4| further 
extends |29| to handle B-twig, which additionally introduces the 
logical-NOT operation into the query. Both represent a query by 
defining special types of nodes for operators, namely logical- AND 
nodes, logical-OR nodes and logical-NOT nodes. For each occur- 
rence of a variable in a structural predicate of a GTPQ, the corre- 
sponding AND/OR-twig or B-twig needs to use a distinct subtree 
to express the structural constraints with respect to descendants as 
specified by the variable, since in AND/OR-twigs and B-twigs, the 
query nodes connected to different operator nodes are considered as 
distinct. The query size of AND/OR-twigs or B-twigs hence may 
be much larger than the size of a GTPQ for expressing complex tree 
patterns. In Fig.|3] the B-twig query has to use two paths U2-U4, and 
U5-ue to represent the constraints that can be imposed by a single 
path U2-U5 in the semantically equivalent GTPQ. Moreover, be- 
fore evaluating the query, |29 | and |4| have to construct OR-blocks 
to normalize the twig. The normalization process is essentially a 
CNF conversion of propositional formulas. Since a CNF conver- 
sion can lead to an exponential explosion of the formula, the time 
cost of a conversion is exponential in the size of original query, and 
the resulting query size also becomes exponential in the worst case. 
Therefore, our query representation is more powerful and compact 
than the tree representation of 1291 and I?)- 

3. FUNDAMENTAL PROBLEMS FOR GEN- 
ERALIZED TREE PATTERN QUERIES 

In this section, we study the problems of satisfiability, contain- 
ment, equivalence, and minimization of GTPQs, which are impor- 
tant for query analysis and optimization. 

3.1 Satisfiability 

A GTPQ Q is satisfiable if there is a data graph G on which the 
answer Q{G) to Q is nonempty. We first introduce some definitions 
before showing how to determine the satisfiability and establishing 
the property of the problem. 

We say u is an independently constraint node if (1) the formula 



(/s(M')[pti/l]®/s(w')b"/0]) /\/s(w) is satisfiable, in which u' is 
the parent of u, fs {u')[pu/x] is the formula produced by assigning 
X to the variable p„ {x £ {0, 1}), and © is the exclusive-or logical 
operator; (2) all ancestors of u are independently constraint nodes. 
Intuitively, the variables of independently constraint nodes can in- 
dependently affect the resulting truth-value of the structural pred- 
icates of their parents and ancestors. Backbone nodes are clearly 
independently constraint nodes, if their structural predicates are sat- 
isfiable. 

A transitive structural predicate ftr{u) for a node u is con- 
structed from fext{u) in a bottom-up sweep as follows. (1) For 
each leaf node and each non-independently constraint node u , the 
transitive structural predicate is the same as the extended structural 
predicate, i.e. ftr{u) = f^xtiu). (2) For an internal node u such 
that the transitive structural predicates of all children have been de- 
fined, ftr{u) is produced by substituting (p„/ A ftr{u')) for each 
variable p„/ of independently constraint node u' in fs{u). 

For two non-root nodes u\,U2 in Q, we say that U2 is similar to 
ui, denoted by u\ < 112, if the following conditions hold. (1) For 
each formula "A op ai" in /a(iii), there is a formula "A op 02" 
in fa{u2) such that (a) if op G {<, <}, a2 < ai, (b) if op £ 
{>, >}, a2 > ai, (c) if op £ {=, 7^}, ai — a2. We use U2 h 
ui to denote that ui and 112 satisfy this condition. (2) For each 
PC (resp. AD) child u'l of ui such that u'l is an independently 
constraint node, there is a PC child (resp. a descendant) u'2 of U2 
such that u'l <s u'2. (3) The formula ftr(u2) — >■ ftriui)[ui (->■ U2] 
is a tautology, where ftr(ui)[ui t-^ 112] is a formula transformed 
from /tr(ui) by replacing p^/ withp^;/ for each pair (it', it") such 
that (a) u' is a descendant of iii , (b) u" is a descendant of U2 and (c) 
u < u" . We say that is subsumed by 112 , denoted by iti < it2 , if 
(1) ui < 112, and (2) the parent of ui is the lowest common ancestor 
uica of Ml and U2, and (a) if ui is a PC child of uica, U2 is also a 
PC child of Uica', (b) otherwise U2 is a descendant of uica- 

We finally define complete structural predicates to characterize 
the whole structural constraints of a GTPQ. For a node u, the com- 
plete structural predicate fcs (u) is created from the corresponding 
transitive structural predicate ft,- (u) by performing the following 
operations: (1) for each descendant u' of u, if its attribute pred- 
icate is unsatisfiable, = /ci'(M)[p«'/0], where 

is the old formula before this transformation and is the 

newly generated formula; (2) for every two nodes ui and U2 in two 
distinct subtrees of u such that U2 < ui, /"/™(ii) = /°s"(ii) A 
{-^Pu, V {pu2 A /e.t(p„J), where /"."(u) and have the 

same meaning as above in (1). 

Theorem[T] shows that the satisfiability of a GTPQ is equivalent 
to the satisfiability of the complete structural predicate of the root, 
if given that the attribute predicate of the root is satisfiable. If the 
query is a conjunctive or union-conjunctive GTPQ, the problem of 
satisfiability can be solved in linear time. When negation is added 
into the query, the satisfiability becomes NP-complete. 

Theorem 1. A GTPQ Q is satisfiable if and only if for the root 
node u ofQ, fa{u) and fcs{u) are both satisfiable. □ 

Theorem 2. 

1. The satisfiability of a union-conjunctive GTPQ can be deter- 
mined in linear time. 

2. The satisfiability of a GTPQ is NP-complete. □ 

Example 4. Consider the query in Fig. |2(b)| All query nodes are 
independently constraint nodes. Replacing p^^ withp^^ A (pug V 
Puio) in fextiug), we have ftrius) = -^Pu^ V {pur A (p„g V 
Puio) ^ Pus)- Since there are no two nodes u and u' such that 





Algorithm 1: minGTPQ 



Figure 4: Examples for four fundamental problems of GTPQs 

U <U', fcs{ui) = ftr{ui) =Pur, Apui App5 Ap„3 A (-'Pug V 

P"io) Apug))- Due to the satisfiability of fcs{ui), 
we see that the query is satisfiable. Indeed, we can get a nonempty 
answer by posing Q on G in Fig. |2(b)| as shown in Example[3] 

Let us turn to Qi and Q2 depicted in Fig.|4] The following table 
presents structural predicates of internal nodes for Qi and Q2- 



= ^Pu2 1 /s(«2) = Pui 


/s(«5) = Pus 


/s(m3) = {Pu5 A Pug) V {-•Pus A Pug) 


fsiUd) = Pm 



For both queries, 115 and us are two non-independently con- 
straint nodes. In Qi, we have U2 <1 ue, because (I) m \~ U2, 

(2) M4 < IJ.7, (3) ftriua) —>■ ftr{u2)[u2 M> Uc] = Puj ^ Puy, 

which is a tautology, (4) U2 is an AD child of ui which is an ances- 
tor of Ms. In contrast, for Q2,U2 ug, since now U2 is a PC child 
of Ml but is not. Suppose attribute predicates of all nodes are 
satisfiable. Then for Q2, fcsi^i) = ^(Pii2 Apui) Ap„3 A ((p„5 A 
Pue Apii^) V {-^Pur, A Pug A Pu^)), which is satisfiable; but for Qi, 
fL{ui) = fcs{ui)A (pug {pu2 Api,^)), which is unsatisfiable. 
Therefore, we know that Q2 is satisfiable and Qi not. □ 

3.2 Containment and Equivalence 

For two GTPQs Qi and Q2, Qi is contained in Q2, denoted by 
Qi C Q2, if for any data graph G, Qi{G) C Q2(G). and 
Q2 is equivalent, denoted by Qi = (52, if Qi{G) C Q2{G) and 
Q2(G) C Qi(G). 

Homomorphism. Given two GTPQs Qi with query nodes V,^ and 
Q2 with query nodes VJ^, a homomorphism from Qi to Q2 is a 
mapping A from Vq to V, U{_L} such that (1) the two sets of output 
nodes of Qi and Q2 are bijective; (2) for any non-independently 
constraint node u £ V^, A(u) =_L; (3) for any independently con- 
straint node iti in Vq, (a) for any PC (resp, AD) child node u'l of 
Ml such that m'i is also an independently constraint node, A(m'i) is a 
PC child (resp, adescendant) of A(mi), and (b) A(mi) h mi; (4) the 
formula fcs{uloot) fcs{uloot)[uloot ^ Kuloot)] is a tautol- 
ogy, where M^oot is the root node of Qi and fcs{ul.^„t)[ul.^„i 
\{ul.oot)\ is a formula transformed from fcs{uloot) by replacing 
p„/ with Pa(u') for each independently constraint node u' £ Vq. 

Theorem[3]yields a decision procedure for containment and equiv- 
alence between two GTPQs. Theorem |4] states the intractability of 
the two problems of containment and equivalence. 

Theorem 3. For two GTPQs Qi and Q2, Qi Q Q2 iff there exists 
a homomorphism from Q2 to Qi. □ 

Theorem 4. The containment checking for GTPQs is co-NP-hard. 

□ 

Example 5. Recall the queries in Fig.|4] We now assume /s (mi ) = 
Pu2 and others the same as in Example [4] Let Qi be a conjunc- 
tive GTPQ, and denote Ui in Qj to distinguish nodes in dif- 
ferent queries. We have that Q2 E Q'i, Q2 E Qi and Qi = 
Q3. Indeed, there is a homomorphism A3, 2 from to Q2, where 



Output: A minimum equivalent GTPQ Qm of Q. 

1. construct an equivalent query Qm from Q by removing subtrees 
rooted at a node whose attribute predicate is unsatisfiable and 
assigning the variables of the removed nodes to for respective 
structural predicates 

2. check each structural predicate to determine for each node whether 
it is an independently constraint node and remove all 
non-independently constraint nodes followed by assigning the 
variables of them to for respective structural predicates 

3. compute the complete structural predicate fcs{u) for each node u in 
Qm in bottom-up order 

4. for each u S in bottom-up order do do 

5. if fcs (u) is unsatisfiable then 

6. fs{parent{u)) := fs{parent{u))[pu/0] 

7. remove the whole subtree rooted at u from Qm 



8. for each node u G V?" do 



9. 
10. 
11. 
12. 

13. 



16. 
17. 
18. 
19. 



if the formula /cs (wr ) — >■ Pu is a tautology then 
for each u' such that u' < udo 

fs{parent{u')) := fs{parent{u'))[pu' /I] 
for each output node Uo in the subtree rooted at u' do 
if there exists u" such that Uo <I u" and the 
subtree queiy pattern rooted at u" and that rooted 
at Uo are isomorphic then 

remove Uo from the set of output nodes and 
add u" into it 

remove nodes in the subtree rooted at u' from Qm 
that are not ancestors of any output nodes and 
corresponding edges they connect 

else if the formula fcs{ur) — > ~'Pu is a tautology then 
for each pair (u, u') g 5 do 

fs{parent{u')) := fs{parent{u'))[pu'/0] 
remove the whole subtree rooted at u' from Qm 



20. return Q„ 



A3,2(Mf) = M?,A3,2(m2) = ul, >^3,2{ul) = M6,A3,2(?i4) = U?- 

There is also Ai,3 from Qi to Q3, in which Xi^3{ul) =_L (i — 
5, 8), Ai.3(Mj) = ulU = 2, 6), Ai,3(mJ,) = ul{k = 4, 7), \i,i{ul) 
= Ml, Ai,3(m3) = M2. We can also derive As^i and Ai,2. □ 

3.3 Minimization 

Since the efficiency of processing a query depends on the size of 
it, it is necessary to identify and eliminate redundant nodes. For a 
GTPQ with query nodes Vq, we define its size as \Q\ — \Vq\. 

Minimization. Given a GTPQ Q, the minimization problem is to 
find another GTPQ Qm such that (1) Q = Qm, (2) \Qm\ < \Q\, 
and (3) there exists no other such Q' with \Q'\ < \Qm\- 

From Theorem[3] we have that for a GTPQ Q, there is a minimal 
equivalent GTPQ of Q whose query nodes are a subset of query 
nodes of Q. We say two GTPQs Qi and Q2 are isomorphic, if there 
is a homomorphism between them that is a one-to-one mapping. 
The following proposition shows that the minimal equivalent query 
of a GTPQ is unique up to isomorphism. 

Proposition 5. Let GTPQs Qi and Q2 be minimal and equivalent. 
Then Qi and Q2 are isomorphic. □ 

Algorithm[T]shows how to minimize a GTPQ. We give an exam- 
ple to illustrate it. 

Example 6. In Fig.|4l the query Q3 is a minimum equivalent query 
of Qi with structural predicates given in Example [5] (1) Since we 
suppose all attribute predicates are satisfiable, there are no nodes 
to be removed in this step, and Qm ~ Qi (line 1). (2) All nodes 



except Us and us are independently constraint nodes, hence we re- 
move Us and ug and assign to in /s(u3), thereby having that 
fsius) = Pug (line 2). In this step, all propositional formulas of 
structural predicates are simplified to equivalent formulas with min- 
imum variables. (3) There are no nodes whose complete structural 
predicates are unsatisfiable, and so none is removed (line 4-7). (4) 
The formula fcs{ui) — >■ Pu^ is a tautology and U2 <1 ue, so U2 and 
its child U4 is removed, and we have fs{ui) — 1, thereby generat- 
ing the query Qs (line 8-19). This step is to remove subtrees which 
can be semantically subsumed by others. □ 
The correctness can be proved based on Theorem[3] Since the al- 
gorithm involves solving SAT problems, the worst-case time com- 
plexity is exponential in the query size. In fact. Theorem [6] shows 
that the minimization problem is NP-hard and hence it is difficult 
to find a polynomial-time algorithm. Nevertheless, because there 
are many high-performance algorithms for SAT and the query size 
is not much large in practice, it is still worth minimizing a GTPQ 
considering the benefits of efficiency of evaluation. 

Theorem 6. The minimization problem for GTPQs is NP-hard. □ 

4. EVALUATING GENERALIZED TREE 
PATTERN QUERIES 

4.1 Framework 

Recall that two major problems that impair the efficiency of al- 
gorithms for processing TPQs over graphs are large intermediate 
results and expensive join operations on them. In the following, we 
propose two new techniques to address them. 

Graph representation of intermediate results. To reduce the cost 
of storing intermediate results and avoid merge-join operations, we 
represent intermediate results as a graph rather than sets of tuples. 
Each match for a path or a substructure of the query pattern can 
be embedded into the tree pattern and hence naturally can be rep- 
resented as a tree. By grouping all the candidate matches by the 
corresponding matched query nodes and adding an edge to connect 
a pair of data nodes whenever there's an edge between the corre- 
sponding pair of query nodes in the query pattern, we can represent 
the intermediate and final results as graphs. In such a graph rep- 
resentation, each data node exists at most once, in contrast to the 
tuple representation in which a data node may be in multiple tu- 
ples. Also, the AD or PC relationship between two nodes is exactly 
represented by only one edge, while in the tuple form the corre- 
sponding two nodes may be put as an element in more than one 
tuple to repeatedly and explicitly represent their relationship. Since 
the size of the intermediate matches may be huge, even exponen- 
tial in both the query size and the data size in the worst case, the 
graph representation is much more compact with at most quadratic 
space cost. Moreover, to enumerate all resulting matches of a pat- 
tern query, we only need to perform one single graph traversal on a 
presumably small graph instead of multiple merge-join operations 
over large intermediate results. 

It is worth noting that such a way of representing intermediate 
results can be also applied to algorithms for other graph pattern 
queries to boost their evaluation. For TPQs, it is particularly op- 
timal because we can enumerate matches directly from the graph. 
However, for graph pattern queries, additional matching operations 
including joins may be unavoidable because it is difficult to locally 
determine which nodes should be traversed to form a match. The 
additional matching operations are in essence an easier evaluation 
of a pattern matching on a smaller graph, such a technique can thus 
still be expected to speed up the whole processing. 



Reachability index enhanced effective pruning. Since the num- 
ber of data nodes to be processed significantly affects the efficiency 
of pattern query evaluation, it is desirable to perform effective prun- 
ing to reduce the number of candidate matching nodes. In the lit- 
erature, 1 6 1 and 1 1 1 1 have developed two pruning approaches for 
reachability query pattern matching. TwigStackD |6| proposed a 
pre-filtering approach that can select nodes guaranteed to be in fi- 
nal matches. Since it has to perform two graph traversals on the 
data graph, it is likely unfeasible for large-scale real-world graphs. 
The work 1 1 1 1 on pattern queries over labeled graphs proposed an- 
other pruning process, namely i?-semijoin, using a special index 
called cluster-based i?-join index. It can filter nodes that cannot 
possibly contribute to partial matches for an AD edge between two 
labeled query nodes. However, (1) the selected nodes may be still 
redundant since the nodes only satisfy the reachability condition 
imposed by one edge and the global structural satisfaction is not 
checked. (2) It is highly costly to construct and store the R-join 
index for a large data graph since the index essentially precom- 
putes and stores all matches for pairwise labels and the index size 
is quadratic in the graph size. (3) It cannot be used to perform 
pruning for queries that have expressive attribute predicates rather 
than a fixed set of labels associated with nodes. Since predicates of 
query nodes are often not fixed and predictable, the index actually 
cannot be precomputed and this approach cannot be used. 

We explore the potentials of existing reachability index for ef- 
fective pruning. It is interesting to note that most reachability in- 
dexing schemes follow a paradigm. They first utilize a relatively 
simple reachability index which often assigns two or three labels to 
each node in order to cover the reachability of a substructure, called 
a cover, such as tree-cover in 1 1,26|, path-tree in [IS], and chain- 
cover in 1 8 19 1 . To cover the remaining reachability information, 
each node keeps one or two lists where complete or just a portion 
of ancestors and descendants are stored. When answering whether 
a node can reach another, the algorithms typically use nodes stored 
in the lists as the intermediate to determine the reachability. 

When it comes to answer a number of reachability queries be- 
tween two sets of nodes, the following two observations are help- 
ful: (1) the lists of different nodes often share a number of nodes, 
(2) the nodes in different lists have rich reachability information. If 
we merge the lists of a set of nodes by eliminating the duplicates 
and those whose reachability information can be derived from oth- 
ers, the merged list "subsumes" all the reachability information in 
the original lists of the node set but the size will not be much larger, 
and possibly even much smaller, than the list size of any individual 
node. Using the merged list, reachability patterns are likely to be 
evaluated more efficiently. 

For example, considering a reachability pattern ua — ub, we 
want to filter data nodes in matiuA) that cannot reach any nodes 
mmatiuB)- Instead of performing \mat{uA)\ x \mat{uB)\ pair- 
wise reachability queries to check for each node v £ matitiA) 
whether it can reach a node v' € mat{uB), (1) we merge all index 
lists of mat{uB) to a single list of the minimum size that preserves 
all the reachability information saved in the original lists; and (2) 
for each v G mat{uA), use the list of v and the merged list rather 
than individual lists for mat{uB) to holistically determine whether 
V reaches some node in mat{uB)- Intuitively, we can regard the 
set matiuB) as a single dummy node which is reachable from all 
nodes that are ancestors of nodes in mat(Ms). 

In this paper, we use 3-hop f 191 as the underlying reachability 
index scheme, as 3-hop has both a very compact index size and rea- 
sonable query processing time. As different labeling schemes are 
often preferable to different graph structures, it is also very flexi- 
ble for our framework to use other labeling schemes to efficiently 
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Figure 5: Chain decomposition and 3-hop index 

process different types of graphs. 

We restrict our attention to in-memory processing and do not ad- 
dress the issues relating to disk-based access methods and physical 
representation of graph data. 

Algorithm outUne. Our GTPQ evaluation algorithm (referred to 
as GTEA) is outlined as follows. First, it prunes candidate match- 
ing nodes that do not satisfy downward structural constraints (i.e. 
not satisfy the subtree pattern query rooted at the corresponding 
query node). Second, it performs the second round pruning pro- 
cess on a carefully selected subtree pattern, called prime subtree, 
to remove nodes not satisfying upward structural constraints (i.e. 
not reachable from any candidate nodes of the root). Third, the 
prime subtree is further shrunk if possible, and GTEA generates 
the matches of the shrunk prime subtree while representing the in- 
termediate results as a graph, from which the final results can be 
efficiently obtained. We begin with focusing on evaluating GTPQs 
with AD edges only and show how to extend the algorithm to pro- 
cess PC edges in Section |4!4l 

4.2 Pruning Candidate Matching Nodes 

We use a two-round pruning process to filter unqualified data 
nodes. The first round selects data nodes that satisfy downward 
structural constraints of the query pattern for each query node. At 
the second round, we then obtain a minimum subtree that contains 
all output nodes having more than one candidate matching node, 
and select necessary edges from this subtree to find nodes satisfying 
upward structural constraints. 

4.2.1 Preliminary: Merging 3-hop index 

3-hop is a recent graph reachability indexing scheme well-known 
for its compact index size and reasonable query time. It follows the 
indexing paradigm mentioned in Section 1431 It uses the chain- 
cover which consists of a set of disjoint chains covering all nodes 
in the graph. Each node in the graph is assigned a chain ID cid and 
its sequence number sid on its chain. For two nodes v and v' on 
the same chain (i.e., v. cid — v' .cid), v <c v', if v. sid < v' .sid. 
In particular, if v. sid < v' .sid, we say v is smaller than v' . Obvi- 
ously, reachability on the chain-cover can be answered using chain 
IDs and sequence numbers. To encode the remaining reachabil- 
ity information outside chain-cover, 3-hop records a successor list 
Lout{v) (resp. predecessor list Lin{v)) of "entry" (resp. "exit") 
nodes to (resp. from) other chains for each node v. The entry (resp. 
exit) node to (resp. from) a chain is the smallest (resp. largest) one 
on that chain that v reaches (resp. reaches v). See II9I for details of 
3-hop index construction. For answering the reachability between 
two nodes vi and V2 on different chains, 3-hop takes the follow- 
ing steps. (1) Collect the smallest nodes on any other chain that vi 
can reach through exit nodes of chain v\.cid. That is, we get a set 
of nodes X„i = {x\x G Lout{v') dLnd'iv'>cVx,x <c 

U{ui}, where Ll-^l'^{v') is the entry node of v' on 
chain x.cid. We call X„j the complete successor list of vi. (2) 



Input: A set of nodes S. 
Output: The predecessor contour C of S. 
1. for each node u G S do 

itC^[v.cid] < ■u.sidthen CP[v.cid] 



■ v.sid 



repeat 

for each index node v" £ Li„ [v') do 
\tCf[v".cid] < v".sidthen 
|_ CP [v". cid] := v".sid 

= prev(i)') 

9. until t;' = null or visited,^! ^^^i > v' .sid 

10. it visited^ cid < v.sidthen visitedycid 

11. return C 



: V.sid 



Collect the largest nodes on any chain that can reach V2 through 
entry nodes of chain V2.cid. In this step, we get a set of nodes 
= {y\y e and V«'<.«2, <c 

y} U {V2}, where is the exit node of v' on chain y.cid. 

We call the complete predecessor list of V2. (3) If there is a pair 
{x, y){x £ Xvi , y £ ) such that x <c y, then we can conclude 
that vi can reach V2. 

Example 7. Fig.[5]gives a chain decomposition of G of Fig. |2(a)| 
and the corresponding 3-hop index. Chain IDs and sequence num- 
bers are omitted. As an example, v^.cid = vn.cid = 1, vn.sid = 

4 and vg.sid — 2. Because v-^.sid < vn.sid, V3 <c vn and 
uii is reachable from v^. To answer whether V3 can reach vg, we 
collect the entry nodes in Lout(vi){i — 3, 7, 11, 16) into = 
{w3, 114}. Then we look up the exit nodes in Li„{vj){j — 9, 5) 
and get Yvg — {vg, V12}. Since there is a pair (114, 1112) such that 
«4 G , V12 G Yvg, and V4 <c V12, we say V3 can reach vg. □ 

Note that to obtain the complete predecessor (resp. successor) 
lists, the original 3-hop needs to visit all larger (resp. smaller) nodes. 
We can assign a forward (and backward) tracing pointer to each 
node which points to the smallest larger (resp. largest smaller) node 
whose Lout (resp. Lin) list is nonempty so that nodes with empty 
lists can be skipped. We define two operations next(v) and pre{v) 
on each node v, which return the node that the forward and the 
backward tracing pointer points to respectively. For example, since 
va is the largest smaller node that has a non-empty Li„ w.r.t. uis, 
prev(ui5) = V6. 

A basic operation of the pruning process is merging the complete 
predecessor/successor lists for a given set of data nodes (denoted by 
S). For the 3-hop case, it picks the largest (resp. smallest) nodes on 
each chain from the complete predecessor (resp. successor) list and 
we call the resultant list predecessor contour C (resp. successor 
contour C"). A node v is said to reach (resp. be reachable from) 

5 if II reaches (resp. is reachable from) at least one node in S. We 
have the following proposition. 

Proposition 7. A data node v reaches mat{u) ijf there is a pair 

(x,y) G Xv X such that x <c y, while mat{u) reaches v iff 
there exists a pair (x , y) G C" x Yy such that x <c y. □ 

Procedure |2] sketches the process of calculating the predecessor 
contour C'^, where visitedi records the largest node on chain i 
whose predecessor list has been looked up. For each node v £ S, 
MergePredLists processes v and those smaller nodes whose prede- 
cessor lists have not been looked up as follows. For each node v' 
to be processed and each exit node v" in Li„{v'), it compares v" 
with the nodes in on the same chain of v", and update C if 
v" is larger (line 4-9). To retrieve nodes from C efficiently, 
can be implemented as a map that uses chain IDs as keys and the 
sequence numbers as values. 



Procedure 3: PmneDownward 



Input: 3-hop index Lout, a GTPQ Q. 

Output: Updated candidate matcliing nodes satisfying downward 
structural constraints. 



Procedure 4: PruneUpward 



1. 

2. 
3. 
4. 

5. 

6. 

7. 

S. 

9. 
10. 
11. 

12. 
13. 



{x\x £ V, X ~ m} 
= MergePredLists(mat(u')) 



for each node u £ Vq do mat{u) :■ 
for each leaf node u' in Vq do C^, 

= Vq\{u'\u' is a leaf node} 
for each u e Vq in bottom-up order do 

for each v G matiu) do chain^ ^id '■= c.hain^^i^ U {ti} 
for each chairii that is not empty do 

for each child u' of u do val[p^i] := 
for each node Vi G chairii do 

for each child m' of u s.t. val[p^i] = do 

if Vi reaches mat{u') then // using Proposition!?] 
|_ val[p^,] := 1 

if fs (w) evaluates to false with the valuation val then 

1^ mat{u) := mat{u)\\yi} 



Input: 3-hop index L^jj, the prime subtree (Vi, Et). 
Output: Updated candidate matching nodes satisfying upward 

structural constraints. 
^uroot '■~ MergeSuccLists (rraat(Mroot)) 

Vt := Vt\{Uroot} 

for each node u G Vt in top-down order such that \mat{u)\ > 1 do 
for each child u' of u such that \mat{u')\ > 1 do 
for each node v G mat{u') do 

chain^„ici := chain^ ^Ad U {v} 
Groupv := Groupv U {u'} 



Cu := MergePredLists (mat(M)) 



9. 
10. 



13. 
14. 



for each node Vi in a nonempty chairii do 

iSmat{u') do not reach Vi then // using Proposition [V] 
for each u' G Groupy. do 
1^ mat{u') := mat{u' )\{vi} 

else break 

for each non-leaf child u' of u do 
|_ C^, := MergeSuccLists(mat(M')) 



Example 8. We show how to compute the predecessor contour of 
mat{uio) for the query Q of Fig. [2] Example [3] have given that 
mat{uio) ~ {vg,vio,vi3,vi5}. The procedure collects the com- 
plete predecessor lists for each of mat{uio) one by one, but no 
predecessor list is repeatedly visited. For example, assume that 
uio is read before vi^. When collecting Kuij, although prev(iii5) 
points to ve, MergePredLists needs not look up Lin{v(i), because 
the list has been looked up when collecting Yvi„. The predecessor 
contour of mat{uio) is {wa, «9, wia, wis}. It can be easily verified 
that the size of this predecessor contour is a half of the total size 
of the four individual complete lists of vg, vio, uis and vi^. Note 
that the size of a predecessor contour is bounded by the number of 
chains. This example actually gives the worst case but still has a 
high compression rate (50%). □ 

Time complexity. The time complexity of the procedure is O ( | S | + 
j Lin I ) , where | Lin \ is the total size of all predecessor lists in 3-hop 
index. It can be observed from the fact that no index node in a 
predecessor list has been ever repeatedly visited. 

Following the same line of MergePredLists, we develop Merge- 
SuccLists that calculates the successor contour of a node set with 
time complexity of OdS] + |I/out|), where |Lout| is the total size 
of all successor lists in 3-hop index. 

4.2.2 Pruning process for downward structural con- 
straints 

Procedure |6] describes the first round of the pruning process. In 
the procedure, val refers to a valuation for variables associated with 
query nodes. PruneDownward first collects mat{-) sorted in the de- 
scending order of sequence numbers for each query node and cal- 
culates the predecessor contours for leaf nodes (line 1-2). Then, it 
processes each non-leaf query node u following a bottom-up fash- 
ion (line 4-14). For each node u, it first groups nodes mat{u) by 
chain ID (line 5). Then for each candidate matching node Vi of u 
on each chain i, PruneDownward checks whether Vi satisfies down- 
ward structural constraints (line 8-13). To do this, (1) it first as- 
signs a valuation to p„/ for each child node u of u according to the 
reachability from Vi to mat[u ) (line 9-11) , (2) and then remove 
Vi from matiu) if the structural predicate fs{u) of u evaluates to 
false under the valuation (line 12-13). Note that when process- 
ing the next node on the same chain, the valuation for the previous 
node is inherited due to the transitive property of transitive closure 
in a chain. Therefore, no predecessor list is repeatedly looked up. 
After all candidate matching nodes for u have been processed, the 



remaining data nodes in mat(u) must satisfy the downward struc- 
tural constraints. Then the predecessor contour for u is computed 
(line 14), and used in the pruning process of the parent node of u. 
The procedure terminates after the root is processed. 

Example 9. We first show how procedure PruneDownward prunes 
mat{ui) of Fig. (2] In a bottom-up fashion, before pruning mat{u2.), 
PruneDownward first processes its non-leaf child u-j. No nodes in 
mat(u-j){i.e,. {vq,V7}) are removed, because ve, can reach both 
mat{ug) and mat{uio) while ut can reach mat{uio). The prede- 
cessor contour for mat(u-j) is then computed and CJ., = {ve , V7}. 
For determining whether V5 should be removed from mat{m), 
PruneDownward checks the reachability between v^ and mat (we), 
matiuj), mat{us) respectively by using the predecessor contours. 
One can verify that V5 cannot reach matiua), which means ?;aZ[pug] 
= and the structural predicate /"^(us) evaluates to true. Thus, 
W5 remains in matiu-j,). Because the other two nodes V3, and 
are in different chains, they do not inherit the valuation determined 
by 115 and PruneDownward needs to check pairwise reachability 
between {113, ng} and {matiu^), matiuj), mat{ug,)}. Only ng 
is subsequently removed, because pug = liPug = Puy = 
and /e^t(w3) evaluates to false. Finally, after this pruning round, 
mat{u'j,) = {113, vs}. 

When PruneDownward refines mat{ui) and reads 112, the as- 
signments of and are directly inherited from the result 
computed in the previous step of processing 114 and /J^t(wi) im- 
mediately evaluates to true without any index lookups. 

PruneDownward gets the following refined candidate matching 
nodes which satisfy the downward structural constraints: mat{u2) 

^ {V3,vs},mat{u3) ^ {V3,V5}. □ 

Time complexity. Since no successor list is repeatedly checked, 
the 3-hop index is looked up for at most liJqjlLoutl times, where 
\Eq\ is the number of edges in the tree pattern. MergePredLists 
is invoked (\Vq\ — 1) times to compute predecessor contours for 
each non-root query node, and the total time cost is 0(|Knat| + 
|V'q||Li„|), where \Vq\ is the number of query nodes and |Knat| 
is the total size of initial candidate matching nodes (i.e. |Knot| = 
'Ei\mat{ui)\). Therefore, PruneDownward is in OdKjKILinl + 
\Lout\) + \ Vmat\) time. 

4.2.3 Pruning process for upward structural con- 
straints 

After the fist-round pruning process, for each backbone node u, 
the remaining nodes in mat{u) satisfy all the structural constraints 
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Figure 6: Example of the maximal matching graph for Q over 
G depicted in Fig.|2] 

imposed by predicates. Because the results of the query should 
consist of matches of output nodes only, the matches for predi- 
cate nodes are no longer useful and do not need to be considered. 
Moreover, some backbone nodes may not contribute to determin- 
ing which candidate matching output nodes are in the same instance 
and hence can be also discarded. With these two observations, the 
structural constraints of a backbone subtree are enough to derive 
the relationships among candidate matching nodes for the output 
query nodes. Such a subtree, we call the prime subtree, can be in- 
duced by the paths from the query root to all such output nodes that 
\mat{-)\ > 1. The next pruning step only needs to consider this 
subtree pattern which in essence is reduced to a conjunctive GTPQ. 

In the opposite direction to PruneDownward, procedure Prune- 
Upward (Procedure IT} traverses down the prime subtree. For each 
query node it, it filters the candidate matching nodes of each child 
u' of u (line 3-14). All the candidate nodes to be processed are 
first clustered and merged into duplicate-free sets according to their 
chain IDs, where the order of nodes is reversed (line 4-7). As a data 
node can match multiple query nodes, the algorithm uses Groupv 
to record the corresponding query nodes that v matches (line 7) in 
order to update mat{-) when a reachability condition is determined 
(line 10-11). Then, for each node Vi £ mat{u') on a nonempty 
chaiui, Vi should be removed if mat{u) cannot reach Vi according 
to Proposition |7] Observe that once a node is confirmed to satisfy 
the condition of the incoming edge, all other larger nodes do not 
need to be checked since they must also satisfy the condition. 

Example 10. In this example, assume that U2 and U3 are out- 
put nodes of Q of Fig. |2] The prime subtree is induced by u\, 
U2 and 113. PruneUpward starts from ui to refine mat{u2) and 
matiuz). After grouping distinct data nodes into chain, it gets 
chaini = {wa}, chain3={vs}, and chain4 = {ws}. W3 is in both 
mat{u2) and mat{uz), but the procedure only stores one copy 
in chain to avoid processing it repeatedly when checking reach- 
ability with mat{ui). After the two query nodes whose match- 
ing candidate nodes have the identical 113 are inserted to Group^^ , 
Groupv^ — {112, U3}- Because mat{ui) reaches V3, V3 is not re- 
moved from either mat{u2) or mat{uz). Similarly, it can be veri- 
fied that niatiui) can reach and W5. In the end, none is removed 
from mat{u2) and mat{us) after this pruning round. □ 

Time complexity. The time complexity is 0(|V,Iiat| + | + 
|iout|)|Vi'j), where |V/| is the number of internal nodes in the 
prime subtree and IVmatl is the total size of the remaining can- 
didate matching nodes after the first pruning round. 

4.3 Computing Final Results 

Shrunk prime subtree. As a result of the pruning process, the 
matching output nodes are guaranteed to be in the answer The left 
to do is to identify how they form the final results by computing 
the matches of edges in the prime subtree. Given a prime subtree, 
assume that u is the lowest common ancestor of all output nodes. 



Input: The maximal matching graph M aximalGraph, a query 

node u and one of its candidate matching node v. 
Output: the answer to the subGTPQ rooted at u and dominated by 

V. 

1. if D is a leaf node then return {u : v} 
1. else 

3. results := 

4. for each branch list bch of v do 

5. branchResults := % 

6. for each node v' that a pointer in bch points to do 

7. branchResults := branchResults U 
CoUectResults (Max jmaiGrap/i, 11') 

results := results X branchResults 



9. 
10. 



if u is an output node then results :- 
return results 



{u : v} y. results 



We can further shrink the subtree by (I) removing the ancestors 
of u if « is not the root, and (2) removing all such nodes u' that 
\mat{u)\ = 1. If the removing process leads to disjoint subtrees, 
we just compute results for each subtree, do a Cartesian product 
of them and add the candidate matching nodes of removed output 
nodes to assemble the whole final results. From now on, we only 
need to compute edge matches for the shrunk prime subtree(s). 

Example 11. The shrunk prime subtree of Q of Fig.[2]is induced 
by U2 and 1*4. Even if we change the query to mark also as 
an output node, the shrunk prime subtree is still the same since 
\mat[uz)\ = |{wi3}| = 1 and vi^ must be in every answer. □ 

Maximal matching graph. The full matches of the shrunk prime 
subtree can be represented by a maximal matching graph Qg{G) = 
{Vr,Er), where (I) Vr C V such that v G 14, if there is a query 
node u £ Vq such that v \= u; (2) Er Vr x V,- such that 
{vi, V2) £ Er, if (vi, V2) is a match of an edge {u\ , U2) € Eq. 

We group the nodes and edges in the graph according to what 
query nodes and edges they match. Specifically, in an implemen- 
tation, each node v has several branch lists, each of which corre- 
sponds to the child of the query node that v matches and includes 
pointers pointing to nodes matching the child. 

Example 12. Recall the GTPQ Q and data graph G in Fig. [2] 
Let U2, U3 and U4 be output nodes. Fig.|6]shows the correspond- 
ing maximal matching graph. As an example, vi has two branch 
lists corresponding to the two incident query edges, denoted by 



bchi and fec/12 respectively, bchi — {ptr^^ , ptr^g }, and fec/12 
{ptr„3 , ptvv^ }, where ptr„ . (i = 3, 5, 8) is pointer to Vi 



□ 



Computing the maximal matching graph. Since the nodes of the 
maximal matching graph have been obtained after the pruning pro- 
cess, we only need to compute matches for each query edge whose 
head and tail both have more than one matching node. Given a 
query edge {ui,U2), a straightforward way is to check the reach- 
ability between nodes in mat{u\) and mat{u2) using 3-hop in- 
dex. The time complexity is 0((|Lj„ + Lont\)\Eq\\Vmat\'Lax), 
with \ Vmat\max being the maximal size of the candidate matching 
nodes after the pruning process. Since in practice many queries 
are highly selective and | Knatlmai: is presumably pretty small, the 
straightforward way is expected to be fast and practical. 

A more sophisticated approach that we choose is to utilize the 
similar technique used in procedure PruneUpward. Observe that 
the loop from line 9 to 12 in PruneUpward is to determine whether 
a data node matching some child of u is reachable from mat(u). 
By replacing with the successor list of a node v, we can simul- 
taneously get all edges from v in the maximal matching graph in 
0(\Lin \ + I Lout 1 + \Ev\), where \Ev \ is the out-degree of v in 



the resulting graph. The total time complexity then is 0{{\Li„ \ + 
Lout)\ V^a"' I + 1 Emg I ) , where | TC"**"'^ | is the number of candidate 
matching nodes for internal query nodes and |-Emg| is the number 
of edges in the resulting maximal matching graph. 

Enumerating results. We next present procedure |5] referred to 
as CollectResults, which derives final results from the maximal 
matching graph. Each result is in a tuple format. To avoid ambigu- 
ity in presentation, we explicitly specify in the tuple which query 
node a data node matches. Specifically, each element in a tuple is 
of the form u : v, which means v is an image of it in a match. 

Procedure CollectResults traverses down the maximal graph. For 
a leaf node, since its corresponding query node must be an output 
node, the procedure returns a tuple with only an element of it (line 
1). For an internal node, it collects results from each child for ev- 
ery branch list, and then does a Cartesian product of them (line 
4-8). If the query node it matches is an output node, it is inserted 
into each result (line 9). The final answer to the query is the union 
of the results of those nodes matching the query root. When query 
nodes in the shrunk prime subtree are all output nodes, no redun- 
dant intermediate results would be produced. Note that no existing 
algorithms for pattern queries on graphs can achieve this. When 
there are non-output query nodes in the shrunk prime subtree, our 
algorithm is not duplicate free. Recall Example 1121 The results 
obtained from vi are the same as those obtained from vs, since ui 
is not an output node and Vi can reach V3. However, the duplicate 
intermediate tuples are a subset of the counterpart of other works, 
because (1) the prime subtree we pick is a minimum subtree of the 
original query pattern that contains all output nodes, (2) for non- 
output nodes, the algorithm merges the intermediate partial results 
in advance (line|7]l. 

Remark. In practical languages, there is also group operation that 
require grouping the results. We can also easily adapt our algorithm 
to support the group operator. In CollectResults, when u is a group 
node, the result returned for v is a tuple containing v and a special 
group element which is the set of matches of the subtree dominated 
by V. That is, in line 9, result := {u : v, (result)}. 

4.4 Evaluating Queries with PC Edges 

In the context of graph database, the research on pattern queries 
often focuses on reachability patterns. Indeed, the reachability pat- 
tern query is an important building block for other queries. Adding 
PC edges to a pattern significantly increases the complexity of eval- 
uation. Even for tree-structured data, |25 1 has theoretically demon- 
strated the difficulty of handling TPQs with arbitrary combination 
of PC and AD edges. 1 25 1 has proved that no holistic algorithms can 
achieve optimality as for queries with AD edges only. For graph- 
structured data, the evaluation of conjunctive pattern queries whose 
edges all represent PC relationship is essentially a computationally- 
hard labeled graph isomorphism problem. Nevertheless, we can use 
the similar idea of our framework to support GTPQs with PC edges. 

When processing a node u in PruneDownward: (1) if it has only 
PC outgoing edges, we merge the set of parents of mat{u) for 
each child it' of u into P„/, instead of computing the predecessor 
contours. Then we sort mat{u) and each P^i, and check for each 
node 1; in mat{u) whether it is in some P^i in a multiway merge- 
sort style. If yes, then val[p^i] := 1, otherwise val[pui] := 0. (2) 
If u has both AD and PC edges, we process these two type of edges 
separately to refine mat{u). Similarly, when performing PruneUp- 
ward, we collect sets of children of mat{u) instead of computing 
the successor contour. 

After the pruning stage, all candidate matching nodes are guar- 
anteed to be in final results. To compute the maximal matching 
graph, we can either do nested joins to check the adjacent relation- 



ships, or perform multiway merge-join to derive the adjacent edges 
in the resulting graph. Other operations including determining the 
prime subtree and enumerating final results are the same. 

Alternatively, we can also use another strategy to deal with PC 
edges. Regarding PC edge as a special type of AD edge, we can first 
process PC edges in the same way with AD edges in the process of 
pruning, except those whose tail's structural variable is the operand 
of a negation operator and which need to be processed as stated be- 
fore. The prime subtree becomes a minimum subtree that contains 
all output nodes and those PC edges that are regarded as AD edges 
when pruning. After computing the maximal matching graph, we 
check whether the two incident nodes of the corresponding edge 
in the maximal matching graph are adjacent in the data graph and 
remove them if not. Next, the unsatisfied nodes are removed in a 
top-down fashion, followed by enumerating final results. We use 
this strategy in our implementation. 

5. EXPERIMENTAL EVALUATION 

In this section, we present an experimental study using both real- 
life and synthetic data to evaluate (1) the efficiency and scalability 
of our algorithm, (2) the effectiveness of representing intermediate 
results as graphs, and (3) the efficiency of the pruning process. 

We only give the experimental results for conjunctive TPQs with 
all query nodes being output nodes (i.e. the traditional TPQs). We 
found that our algorithm has better performance than other algo- 
rithms even for them. Since there has been no other algorithms 
designed for GTPQs and the decomposition-based approach that 
may be applied on top of them to process GTPQs incurs high over- 
head as analyzed in Related work and empirically demonstrated in 
prior studies |16| and |29|, our algorithm can do even far better 
for general GTPQs than those algorithms, compared to the results 
reported here. Additional experimental results concerning I/O cost 
and the results on GTPQs with disjunctive and negative predicates 
can be found in the Appendix. 

Implementation. We have implemented the algorithm proposed in 
Section|4](GTEA), TwigStack |3|, Twig^Stack |7|, TwigStackD 1 6] 
and HGJoin |27|. TwigStack is the classical holistic twig join al- 
gorithm. Twig^ Stack is the latest algorithm for evaluating TPQs 
on tree-structured data which has a distinct feature of representing 
results in hierarchical stacks. Other algorithms for tree-structured 
data that can support disjunction and/or negation, such as BTwig- 
Merge [4J and TwigStackList^ |29|, are in essence the same as 
TwigStack with respect to the conjunctive TPQs and hence are not 
included in our experiments. TwigStackD can evaluate conjunc- 
tive TPQs over graph- structured data. In our implementation, we 
fixed the problems in the original paper |30|. HGJoin is a hash- 
based structural join algorithm for processing graph pattern queries. 
We did not implement the query plan generation in the original al- 
gorithm which relies on selective estimation techniques |22l and 
takes exponential time in the query size; instead, for each query, 
we generated all valid plans and took evaluation on each. The min- 
imum query processing time on the best plan is reported; thus, the 
time presented in this paper is always smaller than the real time 
of the original HGJoin. This version is denoted by HGJoin-l-. By 
representing intermediate results as graphs, we have also imple- 
mented another version denoted by HGJoin*. All experiments are 
performed on a 2.4GHz Intel-Core-i3 CPU with 3.7 GB RAM. 

5.1 On XMark Data 

In this set of experiments, we use large synthetic XMark data 
f241 to evaluate the efficiency and scalability of various algorithms. 
As mentioned in Section[T] many graph- structured XML database 



Table 1: Statistics of XMark datasets 



Scaling factor 


0.5 


1 


1.5 


2 


4 


Dataset size (MB) 


55 


111 


167 


223 


447 


Nodes (Million) 


0.64 


1.29 


1.94 


2.52 


5.17 


Edges (Million) 


0.77 


1.54 


2.32 


3.09 


6.20 



Table 2: The average size of query results on XMarli 



Queries 


55M 


lllM 


167M 


223M 


447M 


Qi 


368 


762.8 


1115.8 


1496.8 


2986.8 


Q2 


34.6 


75.8 


117.8 


150.3 


297.2 


Qs 


1.9 


4.1 


5.8 


6.1 


17.1 



can be modeled by a special form of graphs consisting of trees con- 
nected by cross edges (ID/IDREF links). In this case, we can use 
existing twig join algorithms to process conjunctive TPQs by de- 
composing them into a set of subqueries on separative trees. We 
use TwigStack and Twig^ Stack to investigate the efficiency of ap- 
plying this approach. 

Datasets. We generated five XMark datasets with the scaling fac- 
tors from 0.5 to 4. For each dataset, we generate a graph, where 
nodes correspond to XML elements and edges represent the inter- 
nal links (parent-child) and ID/IDREF links. The attribute for graph 
nodes is the tag of elements except for nodes corresponding to per- 
son, item elements, for each type of which we randomly classify 
them into ten groups to represent different properties. A label is 
assigned to each node according to the tag or the group it belongs 
to. Distinct labels indicate different attribute values. The details of 
the generated documents and graphs are presented in Table[T] 

Queries. Three types of queries we used for experiments are de- 
picted in Fig. |7] where dotted edges refer to ID/IDREF links in 
the original data. For each query type, we generated ten queries 
by randomly choosing a label for each of person and item nodes 
representing a different attribute predicate. The average is reported. 

Experimental results. Fig. [8fa) shows the query evaluation time 
for Qi on datasets varying the data size. The results for Q2 and 
Q3 are quite similar. The results reveal the following. (1) GTEA 
constantly outperforms all other algorithms. Specifically, GTEA 
is three times to more than one order of magnitude faster than 
TwigStack and Twig^Stack, five times to more than two orders of 
magnitude faster than HGJoin, and in the best cases three times 
faster than TwigStackD. When data size becomes larger, the perfor- 
mance gain by GTEA becomes more significant. (2) TwigStackD 
also has very good performance in this set of experiments with the 
following reasons, (a) It utilizes SSPI, a reachability index with 
pretty small size and good querying time for tree-like graphs, (b) 
Its basic idea is extended from the holistic twig join algorithms, 
and so TwigStackD also has the advantages taken by the stack en- 
coding and the blocking method for path results |3|. (c) Although 
TwigStackD has to buffer every nodes in pools (a special structure 
used to store nodes popped from stacks) and large amounts of the 
operations of checking edge conditions with all nodes in pools have 
to be done (indicated as reasons of inefficiency in [27j and U 1 1), the 
pre-filtering process it uses can filter redundant nodes and relieve 
the cost of the above operations. Indeed, without the pre-filtering 
process, TwigStackD is slower by orders of magnitude 1301 . (3) 
It is sort of surprising that TwigStack has slightly better perfor- 
mance than Twig^ Stack. The reason is that although Twig^ Stack 
can avoid generating path matches (as a primary reason for the ef- 
ficiency in 1 7 1), the overhead brought by merging stack trees and 
maintaining the hierarchical structures overrides the benefits in the 
experiments. The fact that the depth of XMark graphs is small (with 
an average of 5), also make the hierarchical stack encoding have not 
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Figure 7: Queries for XMark data 
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Figure 8: Performance results on XMark data 

a strong advantage. Besides, the enumeration of path matches (as 
a reason for inefficiency for TwigStack in [7|) can be done fast us- 
ing the blocking technique. (4) HGJoin has the worst performance, 
mainly because (a) the structural-join way has to generate a large 
number of (largely redundant) intermediate results for small sub- 
structures and (b) non-trivial merge-join operations on them have 
to be done even with the best plan. The query processing time in- 
creases significantly when the size of data graphs increases. 

Fig- nib) shows the results on the XMark dataset of scale 0.5 for 
different queries. (I) The query processing time of GTEA nearly 
maintains the same as the query size increases. In particular, the 
time cost for evaluating Q2 is smaller than that for Qi. It is because 
the size of the results of Q2 is much smaller than that for Qi as pre- 
sented in Table|2] resulting in smaller cost for enumerating the final 
results. (2) The processing time of TwigStack and Twig^Stack does 
not increase significantly over Qi, Q2 and Qa, although they have 
to evaluate a increasing number of subqueries and perform a grow- 
ing number of merge operations. Indeed, as shown in Table [2] the 
sizes of the results of Qi and Q2, which are a subquery of Q2 and 
Q3 respectively, are small and thus the extra cost for evaluating Q2 
and Q3 is very limited. (3) However, HGJoin is much more sensi- 
tive to the increase of the query size, which is due to the impact of 
the redundant intermediate results and expensive sort operations in- 
volved in performing multi-structural joins. The results for HGJoin 
highlight the crucial importance of using a pruning process to re- 
duce the size of intermediate results not contributing to the answer. 

5.2 On arXiv Data 

In this set of experiments, we used a real-life graph to evaluate 
the performance of GTEA, TwigStackD and HGJoin for general 
graph data, verify the effectiveness of graph representation of in- 
termediate results and the efficiency of the pruning process. 

Dataset. We generated a graph from the HEP-Th databasfl orig- 
inally derived from the arXi\Q. There are paper nodes and author 
nodes, each associated with multiple properties. For simplicity, we 
assigned a label to each author node according to the email domain, 
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Figure 9: Performance results on arXiv data, (a) Distribution of tlie result sizes, (b) Query processing time on the queries with small 
sizes of results, (c) Query processing time on the queries with small sizes of results, (d) Comparison with the pre-flltering process. 



and a label to each paper node based on its area and journal it is 
published in, to represent the attributes. The edges of the graph rep- 
resent author or citation relationships. The graph has 9562 nodes, 
28120 edges, and 1132 distinct labels. 

Query generator. We designed a query generator to randomly pro- 
duce meaningful queries. Each query node is associated with a 
label randomly chosen from the data graph to represent attribute 
predicates. Two groups of queries are generated: one has a small 
size of results between 2 and 50, the other has a large size between 
200 and 1200. For each group, five sets of queries were generated 
with query size varying from 5 to 13. We generated fifteen differ- 
ent queries for each size scale and report the average. The average 
time can reflect the average case performance of each algorithm, 
since the queries are generated in a random way. The results for 
queries of distinct sizes in the same group are comparable, because 
the differences of the result sizes of the queries have little impact 
on the query processing time and the number of query results for 
each size scale follow a close distribution as illustrated in Fig.|9la)- 

Experimental results. Fig. I^b) and (c) report the results for the 
two groups of queries. They tell us the following. (1) GTEA has 
the best query processing time, significantly smaller than the pro- 
cessing time of other algorithms (more than one order of magnitude 
in most cases). It also has the best scalability in both two groups of 
experiments. (2) TwigStackD no longer has good performance as 
on XMark data. In fact, it has the longest querying time for queries 
with size of 5 to 9. The arXiv graph is much denser and deeper than 
XMark data, causing the inefficiency of the pool structure as well 
as SSPI. The problem of TwigStackD is highlighted by Fig. |9lc) 
where it fluctuates sharply for queries with large results. The results 
reflect that TwigStackD has rather poor performance for particular 
queries. In contrast, GTEA is most robust since it always main- 
tains good performance for all experiments. (3) HGJoin-l- is not 
quite scalable similar to the performance on the XMark data. Yet 
it now has better performance than TwigStackD when the query 
size is smaller than 11. (4) The revised HGJoin (i.e. HGJoin*) 
has better scalability than HGJoin+. For the group of queries with 
large results, the query processing time of HGJoin* is smaller than 
that of HGJoin+ when the query size is larger than 7, compared 
to 1 1 for the group of queries with small results. This observation 
demonstrates that graph representation of intermediate results can 
improve the performance and achieve better scalability especially 
when there are many intermediate/final results and when the query 
size is large. The reason why the revised one takes more time than 
the original one for processing the queries of small sizes is that 
HGJoin* incurs costs for dynamically and recursively deleting un- 
qualified nodes (not exist in our algorithm though), which offset the 
benefits taken by avoiding merge-join operations on tuples. 

Fig.|3d) evaluates the efficiency of our pruning process and the 



pre-filtering algorithm in TwigStackD, which clearly shows that our 
pruning method greatly outperforms the counterpart and also has 
better scalability with the query size. It is because the pre-fltering 
algorithm in TwigStackD requires two traversals of the data graph. 

6. CONCLUSIONS 

We have proposed the GTPQ, a new class of tree pattern queries 
on graph-structured data, which incorporates structural predicates 
defined in terms of propositional logic to specify structural condi- 
tions. We studied several fundamental problems, and established a 
general framework for evaluating GTPQs using a graph represen- 
tation of graphs and a pruning approach. An algorithm has been 
developed for evaluating GTPQs, which can achieve a small size of 
intermediate results due to the effective pruning process and largely 
avoid generating redundant matches by dynamically shrinking the 
tree pattern during pruning and enumerating processes. 
Acknowledgement. This work is supported by the National Sci- 
ence Foundation of China (61075074). 
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APPENDIX 

A. XQUERY EXAMPLE 

Qi in Example 1 can be expressed in XQuery: 

let Sdblp := doc(dblp.xml) 

for Spaper in $dblp//inproceedings. 



$conf in $dblp//proceedings 
where Spaper/author = "Alice" and $paper/author = "Bob" and 

Spaper/crossref = $conf/@key and data($conf/year) > 2000 and 
data($conf/year) < 2010 

return 

if (exists($paper/year) and exists($conf/title)) 
then <paper> 

<tide> {$paper/tide}</title> 

<year> {$paper/year}</year> 

<conf> {$conf/title}</conf> 

</paper> 

B. PROOFS 

Proof Sketch of Theorem\I\ Given a GTPQ Q, we can safely re- 
move two kinds of nodes as well as their descendants without chang- 
ing the satisfiability: the nodes whose attribute predicates are un- 
satisfiable and those non-independently constraint nodes. We next 
only consider the case where there does not exist such two kinds 
of nodes. We prove that Q is satisfiable, iff for the root node Ur, 
fcsiur) is satisfiable. 

(1) — >■: Suppose G is such a data graph that Q{G) is non-empty. 
Let C be a certificate and T be the corresponding truth assignment 
T on variables in structural predicates: For a query node u, if there 
exists a data node v such that G C and v \= u,py :— 1, otherwise 

■■= 0. 

By the definition of semantics, if = 1, — 1; thus, 

fu- = 1. For each clause {^Pm V (pu2 A fext{pu2)) in fcs{ur), 
because U2 < ui, pui — ^ Pu2 and fext(ui) fext{u2) hold; 
thus, {^Pm V {pu2 A fext{pu2)) is true. Therefore, fcsiur) = 1- 

(2) <— : Suppose T is a satisfying truth assignment of fcs (ur). We 
initialize a data graph G = (V, E, /) as follows. 

(a) For each variable in /c3(iir) such that (pu;A/e2:t(wO)^ = 
1, add a node Vi to G. 

(b) Add an edge (vi,Vj) to G, iff {ui, Uj) is an edge in Q. 

(c) For each node Vi , choose / {vi ) such that / {vi ) satisfies fa {ui ) . 

We simulate the process of evaluating Q on G and denote the 
truth assignment in the evaluation by T' . We assign a truth value to 
each node variable in a bottom-up process according to the seman- 
tics of GTPQ and at the same time modify G if necessary to make 
V as a certificate. 

For any query node m, if p^. — and p^. — 1, it can be 
inferred that there exists Vj £ G such that f{vj) satisfies /a ) . If 
fa{uj) 7^ fa{ui), wc change f{vj) so that fivj) satisfies fa{uj), 
but does not satisfy fa{ui), leading to p^. = 0. 

We next prove by contradiction that after the above processing, if 
Pu — l,p^ = 1. Assume one node at the largest depth, for which 
Pu = 1 and Pu = 0, is it. By assumption, for any descendant Ud 
of u, if p^^ = 1, p^^ = 1. So there must be a child u' of u for 
which p^; — and p^, = 1. From the way G is constructed, there 
is a mapping from u' and its descendants to another descendant 
u" of u and its descendant such that u' < u" and p^ {u") — 1. 
However, since p^(/ca(MT.)) = 1, ifp^(ii") — 1, p^(ii') — 1, 
which is contradictory to our assumption. 

For each backbone node (p„A/ei,t(ii))^ = {puf\fe.xt(u)Y = 
1. So each output node has a non-empty image in V and those im- 
ages constitute an answer to Q. □ 

Proof Sketch of Theorem^ Since attribute predicates are conjunc- 
tive, the satisfiability of them can be determined in linear time. We 
assume in the following that all attribute predicates are satisfiable. 



(1) A union-conjunctive GTPQ where all attribute predicates are 
satisfiable is always satisfiable. 

(2) We prove that the satisfiability problem of a general GTPQ is 
NP-Complete by a reduction from SAT. 

Given any instance cj> of SAT, we suppose (j> has n variables and 
construct a GTPQ Q with n + 1 nodes as follows, (a) First, choose 
the first n nodes, each Vi corresponding to a distinct variable t, 
in 4>- Then, construct an edge from the {n + l)-th node to each 
of them, (b) Each node is associated with a satisfiable attribute 
predicate with a distinct attribute variable. The structural predicate 
of the root is (j>, with p„ . replacing Xi for each non-leaf node Vi . (c) 
The root node, denoted by Ur, is the only output node. 

Since fcsiur) ~ fsiur) = 0, fcs{ur) is satisfiable iff (j) is 
satisfiable. By Theorem[T] the conclusion that Q is satisfiable iff (f) 
is satisfiable immediately follows. 

It is easy to check that the reduction takes linear time and the 
satisfiability is in NP. □ 

Proof Sketch of Theorem\3\ 

(1) — >■ . According to the truth table of the complete structural 
predicate of the root node of Qi, we can enumerate all (potentially 
exponential) combinations of query nodes of Qi such that for each 
combination, there exists a bijection A from a certificate to it. In- 
formally, for each combination as a GTPQ, we can construct a data 
graph G from a satisfying truth assignment in the way we use in 
the proof of Theorem [T] so that the data nodes constitute a certifi- 
cate. By assumption, G is also a certificate with respect to Q2, and 
there is a mapping A' from Q2 to the certificate. Further, A' o A~^ 
is a mapping from Q2 to Qi satisfying the first three conditions in 
the definition of homomorphism. Finally, a homomorphism can be 
derived from all such mappings with respect to the combinations. 

(2) -s— . For the opposite direction, suppose there is a homomor- 
phism A from Q2 to Qi. Let G be a data graph, on which the 
answer of Qi is not empty. Suppose res £ Q\{G) and C is a 
corresponding certificate with the truth assignment denoted by T. 
It is clear that C is a certificate of Q2 with a truth assignment T' 
such that (a) — 1, iff pj^(„) = 1; (2) (itr) — 1 for the root 

Ur- □ 

Proof Sketch of Theorem^ The proof is based on a reduction from 
the tautology checking problem (TCP) of propositional formulas 
to the containment problem of GTPQs by constructing a GTPQ 
from an instance of TCP using the same technique in the proof of 
Theorem |2] □ 

Proof Sketch of Theorem^ The proof is based on a reduction from 
the variable minimal equivalence problem (VME) |T0| in proposi- 
tional logic of propositional formulas to the decision version of the 
minimization problem of GTPQs by constructing a GTPQ from an 
instance of VME using the same technique in the proof of Theorem 
H □ 

C. ADDITIONAL EXPERIMENTAL 
RESULTS 

C.l Measuring I/O cost 

We measure the I/O cost of each algorithm in terms of three met- 
rics, namely the number of data nodes accessed (#input), the num- 
ber of index elements looked up (#index), and the size of interme- 
diate results (#intermediate_results). 

Regarding the number of index lookups, the value for GTEA 
is the total number of elements retrieved from successor and prede- 
cessor lists in 3-hop index; the value for HGJoin is the total number 



of ids and interval lables in tag lists (called Alist and Dlist in 1271 ): 
the value for TwigStackD is the total number of surrogate and sur- 
plus predecessors visited in SSPI. Since TwigStack and Twig^Stack 
do not use a graph reachability index, they have no such cost. 

The cost of intermediate results for each algorithm is computed 
as follows. (I) The value for GTEA is twice the total number of the 
nodes and edges of the maximal matching graph. (2) The values 
for HGJoin, TwigStack and Twig^ Stack include the cost of inter- 
mediate results for subqueries in the form of tuples. (3) In addition, 
TwigStack and Twig^ Stack also involves the space cost of stack 
encoding. (4) Apart from the cost of stack encoding, TwigStackD 
introduces the space cost of pool encoding. It is necessary to clarify 
that in our experiments, all intermediate results are maintained in 
main memory and not stored on disk. This metric is to evaluate the 
worst-case I/O cost caused by the intermediate results. When mea- 
suring this cost, we assume that any intermediate result is written 
to disk and read back to main memory when needed. 




#Input #intei"mediate #index 

Figure 10: I/O cost 

Fig.lIOIdepicts the experimental results for processing Qa on the 
XMark dataset with scale factor 1.5. The detailed costs are reported 
above columns. Note that TwigStack and Twig^ Stack involve ex- 
actly the same I/O cost. 

From the results, TwigStack and Twig^ Stack read the smallest 
number of data nodes. They only need to scan those data nodes 
corresponding to all query nodes for once. In comparison, GTEA 
accesses more, because it needs to perform a two-round pruning 
process (bottom-up and top-down). The value, however, is bounded 
by two times of that of TwigStack. As HGJoin splits a query to 
subtree queries and the different subqueries have identical query 
nodes, HGJoin also accesses some data nodes for more than once, 
with a bound of the maximum number of children of any node in 
the tree pattern. TwigStackD reads far more data nodes than others 
in the experiments, resulting from the two traversals of the data 
graph in the pre-filtering process. 

The results clearly show that GTEA creates much fewer interme- 
diate results than all other four algorithms. TwigStack and Twig'^S- 
tack have more intermediate results than GTEA by four orders of 
magnitude. The huge gap results from the fact that TwigStack and 
Twig^ Stack need to output a large number of intermediate path and 
twig solutions to each subtree query which is far less selective than 
the whole query. The structural joins adopted by HGJoin also intro- 
duce many partial solutions and lead to a large size of intermediate 
results as shown in the figure. For TwigStackD, its pre-filtering 
process selects nodes potentially in the final answers and consider- 
ably saves the space cost of stacks and pools. GTEA shows the best 
performance, as it can prune non-answer nodes as TwigStackD and 
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Table 4: The structural predicates of the queries in Exp-2 
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Figure 11: The tree structure of tested queries 



Table 3: The output nodes of the queries in Exp-1 



open_auction 
open_auction, bidder, seller 
open_auction, bidder, seller, city, profile 
open_auction, item, location 
all query nodes 



represent the intermediate results as a maximal matching graph. 

Fig.llOlshows that GTEA again outperforms HGJoin, due to the 
compact 3-hop index and the effectiveness of the merging opera- 
tions in the pruning process. Yet GTEA incurs more cost for look- 
ing up indexes than TwigStackD. GTEA uses the 3-hop index in 
the two-round pruning process and when constructing the maximal 
matching graph, while TwigStackD looks up the reachability index 
only when expanding the partial solutions in pools. However, the 
small cost achieved by TwigStackD is at the expense of the large 
I/O cost for scanning data nodes in the pre-filtering process which 
significantly reducing the number of nodes to be processed in the 
stacks and pools. Moreover, since indexes are often kept mostly in 
main memory, the difference in the number of disk I/O's needed for 
GTEA and TwigStackD to support the index lookup is supposed to 
be actually small. 

Overall, GTEA achieves good performance gain over other com- 
petitors in terms of I/O cost. The results indicate that the pruning 
process does not incur high I/O cost as TwigStackD and the graph 
representation can keep the space cost of intermediate results pretty 
small. 

C.2 GTPQ Processing 

In this section, we present the experimental results for GTPQs 
with the same structure (Fig. II It on the XMark data set with scale 
factor 4. Since HGJoin and TwigStackD need to do the same deco- 
mpose-and-merge operations to process GTPQs and our experi- 
ments for conjunctive queries have shown that TwigStackD signif- 
icantly outperforms HGJoin, we did not include HGJoin in this set 
of experiments. Twig^ Stack was also not included as it has compa- 
rable performance to TwigStack and the post-process on top of the 
two algorithms for processing GTPQs is also the same. 

Exp-1 Optimization for non-output nodes. We first compare GT- 
EA, TwigStack and TwigStackD for processing conjunctive queries 
with varying the size of output nodes. The output nodes for each 
tested query are given in Table [3] The result sizes of those queries 
are presented in Table |5] Because TwigStack and TwigStackD are 
not optimized for queries with non-output nodes and the differences 
in the result sizes of the tested queries are small, the processing time 
on different queries is close to each other for both algorithms. Fig. 
[T2ja) depicts the results of GTEA only. Recall that GTEA uses a 
prime subtree obtained based on the output nodes and the specific 
matching nodes in procedure PruneDownward and constructing the 
maximal matching graph for avoiding creating useless matches to 
non-output nodes. Hence, the processing time of GTEA depends on 
the structure of the prime subtree and the size of the final answers. 
The results show that the less the number of output nodes is, the 
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Table 5: Numbers of query results 
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less processing time the evaluation generally takes. 

Exp-2 GTPQ processing. We next show the experimental results 
for queries that may contain negation and disjunction. Three classes 
of tested queries, namely the queries with disjunction only (DIS), 
those with negation only (NEG) and with both disjunction and nega- 
tion (DIS_NEG), are shown in Table|4] All potentially valid back- 
bone nodes are set as output nodes for all queries. Fig. ll2r b'). (c) 
and (d) depict the results for the tested GTPQs. All of them con- 
sistently verify the significant performance gain of GTEA (from 
several times to three orders of magnitude). Indeed, as mentioned 
in the related work, TwigStack and TwigStackD need to process a 
number of small subqueries and do expensive post merge-join op- 
erations for processing GTPQs. It is non-trivial to fine tune the 
two algorithms for GTPQs. It may be possible to derive an ef- 
ficient mechanism that makes the intermediate results output by 
TwigStack and TwigStackD in sorted order so that the merge-join 
operations take less cost. However, it is difficult to reduce the 
large size of intermediate results which considerably impairs the 
efficiency of TwigStack and TwigStackD, so they are unlikely to 
outperform our algorithm anyway. 

D. PROCESSING QUERIES WITH MULTI- 
PLE OUTPUT STRUCTURES 

GTEA can be straightforwardly extended to process queries not 
restricted to backbone nodes. The only modification is in proce- 
dure CollectResults. For an internal node in the maximal match- 
ing graph, instead of doing one Cartesian product of the results of 
branches, the procedure may perform several Cartesian products 
of the results of different branches depending on the specified re- 
sult structures. Take the query DISi (the query structure is shown 
in Fig. [TT] and the predicates are defined in Table |4j for example, 
and suppose that the results of the query should be of the form 
(open_auction, bidder, item) or (open_auction, seller, item). 
The (shrunk) prime subtree is constructed by considering bidder 
and seller as the originally defined backbone nodes. In the max- 
imal matching graph, for each matching nodes of open_auction, 
the CollectResults procedure performs two Cartesian products to 
derive the answers: one product of the two branch results corre- 




Figure 12: GTPQ Processing 



spending to bidder and item, and the other product of the two 
branch results corresponding to seller and item. 

E. EXPANDED ALGORITHMS 

We show procedure PruneDownward and PruneUpward in more 
details in Procedure 6 and Procedure 7. 

Procedure 6: PruneDownward 

Input: 3-Hop index Lout, a GTPQ Q 

Output: Candidate matching nodes satisfying downward structural 
constraints. 

1. for each node u EVqdo mat(u) := {x\x ^V,x u} 

2. for each leaf node m' in Vq do C^, := MergePredLists(mat(u')) 
Vg = Vq\{u'\u' is a leaf node} 
for each u G in bottom-up order do 

for each v G rnat(u) do chain^ ^id '■= chainy.dd U {v} 
for each chairii that is not empty do 

for each child u' of u do val[p^/] := 
for each node Vi G chairii do 

for each child u' of u s.t. val[p^/] = do 
|_ if Cj^,[i] > Vi.sid then val\py,i] := 1 

v'i := Vi 
repeat 

for each index node t)" e LoutW) do 

for each child u' of u s.t. val\p.^^i] = do 
if C, [v''.cid\ > v'/.sid then 



17. 
18. 
19. 
20. 

21. 
22. 



v'- := next(D9 
until v'^ = null or visitedi < v'^-sid 
if fs («) evaluates to false with the valuation val then 
|_ mat(u) := mat{u)\{vi} 

visitedi '■= Vi.sid 
Cu ■= MergePredLists (mat (u)) 



Procedure 7: PruneUpward 



Input: 3-hop index Li„, the prime subtree (V^', E^) of a GTPQ 
Output: Candidate matching nodes satisfying upward structural 

constraints 

^uroot ■~ MergeSuccLists(?Ttat(«TOot)) 
y/ ~ V^\{u'\u' is a leaf node} 



9. 
10. 
11. 
12. 
13. 
14. 
15. 
16. 

17. 
18. 
19. 
20. 
21. 
22. 

23. 
24. 

25. 
2«. 



for each node « G VJ.' from top to bottom such that \mat{u) \ > 1 



do 



for each child u' of u such that \mat{u')\ > 1 do 
for each node v G mat{u') do 

chain'" . , ;= chain" ., U {v\ 
Group V := Group„ U {u'} 

merge aU Usts chainf (u' is a child of u) into chaini for each 
chain i 

for each chaini that is nonempty do 
for each node Vi 6 chaini do 

if Cu[i] < i^j.sidthen reach := true; break 
< := Vi 
repeat 

for each index node v" 6 Lin{v'i) do 
if C^K'.cid] < ■u".sid then 
|_ reach := true; break 

if reach = true then break 
v'i := prev{vl) 
unto v'i = nuU or visitedi > fj.sid 
if reach = false then 

for each u' G Group^- do 
|_ mat[ti'] := mat[u']\{i)i} 



else break 

visitedi '■= 



Vi.sid 



for each non-leaf child u' of « do 
|_ C^, := MergeSuccLists(mat(M')) 



